In [118]:
from fastai.vision.all import ClassificationInterpretation, get_image_files
import torch
import pandas as pd
from fastai.learner import Learner
from fastai.data.core import DataLoader
from fastai.vision.all import *
import platform
import os
import ml_utils
import pickle
import importlib
import torch
from skimage.segmentation import mark_boundaries
from torchvision import models, transforms
from lime import lime_image
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import seaborn as sns
import platform
import subprocess
In [1]:
VERBOSE = False
In [2]:
if VERBOSE:
    print(os.name)

    print(platform.system())
    print(platform.release())
    print(platform.version())

    print(platform.machine())
    print(platform.processor())
    cuda_available = torch.cuda.is_available()
    print("CUDA is available:", cuda_available)

    if cuda_available:
        # Check if PyTorch is currently using CUDA
        is_using_cuda = torch.cuda.is_initialized()
        print("PyTorch is using CUDA:", is_using_cuda)

        # Get the name of your GPU
        gpu_name = torch.cuda.get_device_name(0)
        print("GPU:", gpu_name)
        print(f"PyTorch version: {torch.version.cuda}")
    else:
        raise Exception("CUDA is not available")
In [151]:
def get_cpu_info():
    if platform.system() == "Windows":
        cpu_model = platform.processor()
    elif platform.system() == "Darwin":
        cpu_model = subprocess.check_output(["sysctl", "-n", "machdep.cpu.brand_string"]).strip().decode()
    elif platform.system() == "Linux":
        command = "cat /proc/cpuinfo | grep 'model name' | head -1"
        cpu_model = subprocess.check_output(command, shell=True).strip().decode().split(": ")[1]
    else:
        cpu_model = "Unknown CPU"
    
    num_cores = os.cpu_count()
    return cpu_model, num_cores

Fine Tuning a ResNet Model for Image Classification¶

Our goal is to train a classification model capable of identifying different kinds of mushrooms using a pretrained ResNet model. This transfer learning approach is effective with limited data, leveraging the knowledge embedded in ResNet50 which was trained on much larger datasets.

  • The model is trained with around 5000 images, applying conservative augmentation techniques.

  • Initially, the model is tuned with its initial convolutional layers frozen. These layers capture universal visual features like edges, textures, and basic shapes, with filters trained to detect low-level, general characteristics useful across various visual tasks. Freezing these layers prevents their weights from updating, allowing the model to learn class-specific features without altering the general image representations already learned by ResNet50 (this minimizes overfitting) .e.g with ResNet50:

  • Initially, the model's initial convolutional layers are frozen to maintain their ability to capture universal visual features such as edges and textures and basic shapes. Freezing prevents updates to these layers' weights, restricting the "learning" to clss class-specific features without altering the general image representations learned by ResNet50 therefore minimizing the risk overfitting. .e.g. structure of ResNet50:

    • An input layer processing images,

    • A 7x7 convolutional layer with 64 filters and a 3x3 max pooling layer, which together extract basic visual features like edges and textures.

    • sequence of residual blocks:

      • Conv Block 1: Three layers with 64, 64, 256 filters, repeated three times.
      • Conv Block 2: Increases filters to 128, 128, 512, repeated four times.
      • Conv Block 3: Escalates to 256, 256, 1024, with six repetitions.
      • Conv Block 4: Peaks at 512, 512, 2048 filters, repeated three times.
      • Average pooling Llyer: Reduces feature dimensionality.
      • Fully connected layer: Executes final classification.

Conv Block 4 and the Fully Connected Layer are unfrozen during training to allow fine-tuning to dataset-specific features.

  • Subsequently, the entire model is unfrozen for more comprehensive fine-tuning, including the pre-trained weights, over N epochs (using an early stopping callback to stop training after the validation loss stops decreasing).

Backpropagation updates weights based on loss gradients, supplemented by learning rate adjustments such as annealing to refine updates without significant deviations from the pretrained configuration.

These layers adapt their filters to better represent the unique features of our mushroom dataset (this would in theory involve more complex feature interactions than those in the original training dataset, like ImageNet)

Model Selection¶

ResNet18 - 18 layers deep with fewer filters and layers compared to ResNet34 and ResNet50. Specifically, it has 2 layers each in the first three sets of its convolutional blocks and 2 layers in the last set. ~11.7 million parameters.

Achieved up to F1= ~0.91 on the validation set

--

ResNet34 - 34 total, it has 3, 4, 6, and 3 layers in the four sets of its convolutional. ~ 21.8 million params

up to F1= ~0.92

--

ResNet50 - introduces bottleneck layers to reduce the computational burden. It has a different block structure with 1x1, 3x3, and 1x1 convolutions where the 1x1 layers are responsible for reducing and then increasing (restoring) dimensions, keeping the 3x3 layer a bottleneck with fewer input/output dimensions. ~25.6 million params.

up to F1 = ~0.95

Parameter Selection and Tuning¶

We've performed extensive tuning for the model (~50 trial for selecting appropriate hyperparameter and ~150 trials for selecting the optimal parameters) using Bayesian tuning, so that will be used as the basis of our model, however most of the parameters don't seem to have a signficant impact besides:

  • pct_start defines the percentage of cycles for increasing the learning rate, impacting speed and effectiveness of neural network training adjustments.
  • augmentations (i.e. image transformation like changing the scale, size, rotation of the image, using various other techniques like erasing parts of the image etc.). One issue with our results is that we tuned our model using discrete sets of transformations instead of tuning individual parameters.

Additionally:

  • We've found that using class weighted to handle the imbalance in the dataset had no or limited effects so we're not employing any over or under sample technique (generating additional synthetic classes might be an option that could be explored).
  • Tuning was only performed for the ResNet50 model
Selected Parameters:¶
In [6]:
with open("studies/main_study.pkl", "rb") as f:
    loaded_study = pickle.load(f)
selected_trial = loaded_study.trials[84]
In [7]:
seected_params = selected_trial.params
seected_params
Out[7]:
{'batch_size': 64,
 'base_lr': 0.0014050114695105182,
 'weight_decay': 0.09308356639366534,
 'lr_mult': 10,
 'lr_scheduler': 'flat_cos',
 'freeze_epochs': 6,
 'pct_start': 0.39954565320066476,
 'aug_mode': 'mult_1.25_more_trans'}
Model fine tuning and training¶

We've selected ResNet50 as our final "production" model because we were able to achieve signficantly better performance with it, however depending on the application and technical constraints this might not be the optimal choice:

  • Tuning/training in a reasonable amount of time requires a relatively recent GPU with at least 16 GB or so of memory.
  • However, relatively to more modern deeply learning models (especially LLMs etc.) memory requirements for inference are low and shouldn't exceed a few hundred MB even for ResNet50 for a single image and 50-100ms or so even on CPUs.
  • This becomes a much more important issue when if we're working with live recognition/videos/AR rather than individual static images. In that case the 3-5x performance difference might become very significant when running on non high-end desktop/server level hardware e.g. AR and similar apps on mobile devices would generally use ResNet18, ResNet34 or more likely shallower models like MobileNet, EfficientNet etc which would have lower parameter count and be faster.
In [8]:
importlib.reload(ml_utils)

DS_PATH = "dataset"

batch_size = 128#seected_params["batch_size"]
base_lr = seected_params["base_lr"]
weight_decay = seected_params["weight_decay"]
lr_mult = seected_params["lr_mult"]
lr_scheduler_type = seected_params["lr_scheduler"]
freeze_epochs = seected_params["freeze_epochs"]
pct_start = seected_params["pct_start"]

balancing = None

aug_mode = "mult_1.25_more_trans"
# aug_mode = "mult_1.25_more_trans"

num_epochs = 25
freeze_epochs = 6

data = ml_utils.create_dataloaders(batch_size, aug_mode=aug_mode, path=DS_PATH)

learn = ml_utils.create_learner(
    data, weight_decay, balancing, model_type=ml_utils.ModelType.RESNET50
)

callbacks = [ml_utils.EarlyStoppingCallback(monitor="valid_loss", patience=4)]
Augmentation Mode: mult_1.25_more_trans | Data Type: Train
using RESNET50
<function resnet50 at 0x70262070eb90>
In [9]:
importlib.reload(ml_utils)
learn = ml_utils.train_model(
    learn=learn,
    base_lr=base_lr,
    lr_mult=lr_mult,
    lr_scheduler_type=lr_scheduler_type,
    freeze_epochs=freeze_epochs,
    num_epochs=num_epochs,
    pct_start=pct_start,
    callbacks=callbacks,
)
metrics = ml_utils.extract_metrics(learn)
Model is on CUDA
epoch train_loss valid_loss accuracy f1_score f1_score f1_score time
0 2.201875 1.372627 0.660160 0.649471 0.660160 0.599223 00:14
1 1.802115 1.206056 0.704525 0.698271 0.704525 0.659144 00:13
2 1.549524 1.134944 0.745342 0.740111 0.745342 0.710000 00:13
3 1.405611 1.135062 0.732032 0.728327 0.732032 0.690126 00:13
4 1.289279 1.090290 0.753327 0.749645 0.753327 0.726165 00:13
5 1.210203 1.057118 0.755102 0.750526 0.755102 0.719891 00:13
epoch train_loss valid_loss accuracy f1_score f1_score f1_score time
0 1.095013 0.995483 0.819876 0.818026 0.819876 0.787816 00:16
1 0.925314 0.862077 0.858917 0.858179 0.858917 0.832653 00:16
2 0.829193 0.810338 0.877551 0.876764 0.877551 0.848774 00:16
3 0.755437 0.785751 0.881988 0.880593 0.881988 0.861716 00:16
4 0.703322 0.750437 0.900621 0.899623 0.900621 0.882583 00:15
5 0.676031 0.755283 0.891748 0.891298 0.891748 0.875207 00:16
6 0.643748 0.727963 0.910382 0.909308 0.910382 0.892095 00:16
7 0.621263 0.700396 0.921029 0.920402 0.921029 0.906657 00:16
8 0.610846 0.709269 0.919255 0.918628 0.919255 0.904956 00:16
9 0.597556 0.734286 0.888199 0.886879 0.888199 0.871929 00:16
10 0.592544 0.698643 0.916593 0.915706 0.916593 0.900124 00:16
11 0.591604 0.691277 0.918367 0.917716 0.918367 0.899072 00:16
12 0.585929 0.687158 0.924579 0.923806 0.924579 0.908399 00:16
13 0.576905 0.680047 0.921029 0.920304 0.921029 0.908931 00:16
14 0.565662 0.680570 0.921029 0.919747 0.921029 0.896938 00:16
15 0.553509 0.652810 0.937001 0.936754 0.937001 0.923529 00:16
16 0.545040 0.652485 0.933452 0.933213 0.933452 0.924044 00:16
17 0.538019 0.641254 0.935226 0.935154 0.935226 0.926492 00:16
18 0.533519 0.632918 0.943212 0.942863 0.943212 0.933724 00:16
19 0.527825 0.631027 0.940550 0.940248 0.940550 0.925694 00:16
20 0.525236 0.624010 0.943212 0.943094 0.943212 0.929062 00:15
21 0.519706 0.623547 0.944987 0.944514 0.944987 0.933056 00:16
22 0.520856 0.620551 0.943212 0.942946 0.943212 0.930493 00:16
23 0.518598 0.620019 0.944987 0.944672 0.944987 0.930896 00:16
24 0.516994 0.619290 0.947649 0.947328 0.947649 0.935211 00:16
In [10]:
learn.recorder.plot_loss()
Out[10]:
<Axes: title={'center': 'learning curve'}, xlabel='steps', ylabel='loss'>
No description has been provided for this image

Overfitting does not seem to be a significant issue, validation loss was decreasing during the entire training process and the difference between train and validation loss is relatively low.

The table below show the classification metrics on full training, validation and testing samples with any augmentations turned off.

We are using a separate test sample because the validation sample was using during hyperparameter tuning and the selected parameters and augmentation options might be indirectly "overfitted" on the validation sample. However, considering that the dataset is relatively small using just 2 samples might be sufficient (ideally we'd also use CV).

In [11]:
importlib.reload(ml_utils)
train_data = ml_utils.create_dataloaders(
    batch_size, aug_mode=None, path=DS_PATH, is_test=False, valid_pct=0.2
)

train_dl = train_data.train
valid_dl = train_data.valid
test_files = get_image_files(f"{DS_PATH}/test")
test_dl = train_data.test_dl(test_files, with_labels=True)

learn.dls = train_data

train_results = learn.validate(dl=train_dl)

valid_results = learn.validate(dl=valid_dl)

test_results = learn.validate(dl=test_dl)

n_train = len(train_dl.dataset)
n_valid = len(valid_dl.dataset)
n_test = len(test_dl.dataset)

data = {
    "Dataset": ["Train", "Validation", "Test"],
    "N": [n_train, n_valid, n_test],
    "Loss": [train_results[0], valid_results[0], test_results[0]],
    "Accuracy": [train_results[1], valid_results[1], test_results[1]],
    "Weighted F1": [train_results[2], valid_results[2], test_results[2]],
    "Micro F1": [train_results[3], valid_results[3], test_results[3]],
    "Macro F1": [train_results[4], valid_results[4], test_results[4]],
}

results_df = pd.DataFrame(data)
Augmentation Mode: None | Data Type: Train
No augmentations applied.
In [12]:
results_df
Out[12]:
Dataset N Loss Accuracy Weighted F1 Micro F1 Macro F1
0 Train 4512 0.493925 1.000000 1.000000 1.000000 1.000000
1 Validation 1127 0.619294 0.947649 0.947328 0.947649 0.935211
2 Test 996 0.613501 0.947791 0.947677 0.947791 0.945465

By Class Performance¶

In [13]:
from fastai.vision.all import *
import pandas as pd
from sklearn.metrics import precision_recall_fscore_support

importlib.reload(ml_utils)
device = "cuda" if torch.cuda.is_available() else "cpu"
learn.to(device)  # Ensure the model is on the correct device

test_files = get_image_files(f"{DS_PATH}/test")


def label_func(x):
    return x.parent.name


tst_dl = ml_utils.create_dataloaders(
    batch_size, aug_mode=aug_mode, path=DS_PATH, is_test=True
)
preds, targs = learn.get_preds(dl=tst_dl)
preds = preds.argmax(dim=1)
Augmentation Mode: mult_1.25_more_trans | Data Type: Test
No augmentations applied.
In [14]:
preds, targs = preds.cpu(), targs.cpu()
precision, recall, f1, support = precision_recall_fscore_support(
    targs, preds, average=None
)
metrics_df = pd.DataFrame(
    {
        "Class": test_dl.vocab,
        "Precision": precision,
        "Recall": recall,
        "F1 Score": f1,
        "Count": support,
    }
)
test_summary_df = pd.concat([metrics_df], ignore_index=True)
In [15]:
test_summary_df
Out[15]:
Class Precision Recall F1 Score Count
0 Agaricus 0.921569 0.886792 0.903846 53
1 Amanita 0.937500 0.937500 0.937500 112
2 Boletus 0.958084 0.993789 0.975610 161
3 Cortinarius 0.928000 0.928000 0.928000 125
4 Entoloma 0.903846 0.854545 0.878505 55
5 Hygrocybe 1.000000 0.978723 0.989247 47
6 Lactarius 0.938053 0.942222 0.940133 225
7 Russula 0.952941 0.947368 0.950147 171
8 Suillus 0.914894 0.914894 0.914894 47
In [16]:
interp = ClassificationInterpretation.from_learner(learn, dl=tst_dl)
interp.plot_confusion_matrix()
No description has been provided for this image

Largest Losses (i.e. worst predictions)¶

In [21]:
interp.plot_top_losses(k=20)
No description has been provided for this image
In [22]:
#### Smallest Losses (i.e. best predictions)
In [23]:
interp.plot_top_losses(k=20, largest=False)
No description has been provided for this image
Most Confused Mushroom Types (actual, predicted, n. occurrences)¶
In [24]:
most_confused_pairs = interp.most_confused()
In [25]:
[p for p in most_confused_pairs if p[2] > 1]
Out[25]:
[('Entoloma', 'Lactarius', 5),
 ('Lactarius', 'Russula', 5),
 ('Russula', 'Lactarius', 5),
 ('Amanita', 'Cortinarius', 3),
 ('Agaricus', 'Entoloma', 2),
 ('Amanita', 'Agaricus', 2),
 ('Boletus', 'Cortinarius', 2),
 ('Cortinarius', 'Amanita', 2),
 ('Cortinarius', 'Boletus', 2),
 ('Cortinarius', 'Lactarius', 2),
 ('Cortinarius', 'Russula', 2),
 ('Lactarius', 'Amanita', 2),
 ('Lactarius', 'Entoloma', 2),
 ('Russula', 'Agaricus', 2),
 ('Russula', 'Boletus', 2),
 ('Suillus', 'Boletus', 2)]

Explaining Predictions using LIME¶

LIME (Local Interpretable Model-agnostic Explanations) works by creating interpretable models around the predictions made by a complex model like ResNet. Basically it creates a very large sample of "perturbed" images which are used to identify how predictions change based on them), this allows it to highlight important areas that the decision of the main model was based on.

In [59]:
def get_dl_file_name_by_index(dl, index):
    items = dl.items
    return items[index]


def find_index_by_file_name(dl, file_name):
    for i, item in enumerate(dl.items):
        if file_name in str(item):
            return i

    return None
# find_index_by_file_name(tst_dl, "101_ocRZyv2hUFg.jpg")
# find_index_by_file_name(tst_dl, "331_31Dp3yn9rZs.jpg")
# find_index_by_file_name(tst_dl, "162_7Dc7z0eaPkw.jpg")
# find_index_by_file_name(tst_dl, "215_XXWsMnFz1MI.jpg")
# find_index_by_file_name(tst_dl, "046_4iVp26aGB-M.jpg")
In [90]:
indices = [504, 1, 101, 514, 477, 754] # Add 110, 893 for more
interp.show_results(indices)
No description has been provided for this image
In [ ]:
class_names = test_dl.vocab
In [103]:
# GPU support enabled 
image_path = get_dl_file_name_by_index(tst_dl, 504)

pytorch_model = learn.model
pytorch_model.eval()

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Device: {device}")
pytorch_model.to(device)

def get_input_transform():
    normalize = Normalize.from_stats(*imagenet_stats)
    transf = transforms.Compose([
        transforms.Resize((224, 224)),
        transforms.CenterCrop(224),
        transforms.ToTensor(),
        normalize,
    ])
    return transf

def get_input_tensor(image_path, device=device):
    img = Image.open(image_path).convert("RGB")
    tensor = get_input_transform()(img).unsqueeze(0)  # Add batch dimension
    return tensor.to(device)

def batch_predict(images, device=device):
    learn.model.eval()
    batch = torch.stack(
        [get_input_transform()(Image.fromarray(img.astype("uint8"))) for img in images],
        dim=0,
    )
    batch = batch.to(device)
    with torch.no_grad():
        logits = learn.model(batch)
    probs = F.softmax(logits, dim=1)
    return probs.detach().cpu().numpy()
Device: cuda
In [100]:
lime_explanations = {}
for idx in indices:
    image_path = get_dl_file_name_by_index(tst_dl, idx)
    print(f"{idx} - > {image_path}")
    img_tensor = get_input_tensor(image_path, device=device)

    logits = pytorch_model(img_tensor)
    probs = F.softmax(logits, dim=1)
    
    explainer = lime_image.LimeImageExplainer()
    explanation = explainer.explain_instance(
        np.array(Image.open(image_path).convert("RGB")),
        batch_predict, 
        top_labels=1,
        hide_color=0,
        num_samples=1000,
    )
    lime_explanations[idx] = (img_tensor, probs, explanation, image_path)
In [106]:
for idx, lime_data in lime_explanations.items():

    img_tensor, probs, explanation, image_path =  lime_data
    label_to_explain = explanation.top_labels[0]  # get  the most likely label index
    temp, mask = explanation.get_image_and_mask(
        label_to_explain,
        positive_only=False, 
        negative_only= False,
        num_features=15, 
        hide_rest=False
    )

    actual_label=""
    img_boundary = mark_boundaries(temp / 255.0, mask)
    plt.imshow(img_boundary)
    plt.title(
        f"Top predicted class: {class_names[label_to_explain]} - Probability: {probs[0][label_to_explain].item():.3f}{actual_label}"
    )
    plt.show()
    for i, prob in enumerate(probs[0]):
        print(f"{class_names[i]}: {prob.item():.3f}")
No description has been provided for this image
Agaricus: 0.060
Amanita: 0.049
Boletus: 0.751
Cortinarius: 0.019
Entoloma: 0.021
Hygrocybe: 0.020
Lactarius: 0.040
Russula: 0.019
Suillus: 0.019
No description has been provided for this image
Agaricus: 0.017
Amanita: 0.046
Boletus: 0.021
Cortinarius: 0.015
Entoloma: 0.022
Hygrocybe: 0.021
Lactarius: 0.030
Russula: 0.814
Suillus: 0.013
No description has been provided for this image
Agaricus: 0.010
Amanita: 0.010
Boletus: 0.916
Cortinarius: 0.010
Entoloma: 0.011
Hygrocybe: 0.015
Lactarius: 0.010
Russula: 0.010
Suillus: 0.008
No description has been provided for this image
Agaricus: 0.900
Amanita: 0.008
Boletus: 0.011
Cortinarius: 0.007
Entoloma: 0.034
Hygrocybe: 0.010
Lactarius: 0.005
Russula: 0.012
Suillus: 0.013
No description has been provided for this image
Agaricus: 0.012
Amanita: 0.882
Boletus: 0.012
Cortinarius: 0.013
Entoloma: 0.016
Hygrocybe: 0.015
Lactarius: 0.015
Russula: 0.021
Suillus: 0.016
No description has been provided for this image
Agaricus: 0.013
Amanita: 0.009
Boletus: 0.011
Cortinarius: 0.007
Entoloma: 0.009
Hygrocybe: 0.923
Lactarius: 0.006
Russula: 0.010
Suillus: 0.013

Inference Performance¶

The table below shows inference performance depending on the number of images in a single batch. We can see that inference on the GPU is much more scalable and allows evaluating samples in parallels.

In [109]:
import time
import psutil
import pandas as pd
import torch
from fastai.vision.all import *

def measure_inference_speed(learn, dl, num_samples, device):
    # Prepare a batch of data
    data_batch = next(iter(dl))
    inputs, _ = data_batch
    inputs = inputs[:num_samples].to(device)
    
    # Measure memory usage before inference
    if device == 'cuda':
        torch.cuda.empty_cache()
        torch.cuda.reset_peak_memory_stats(device)
    else:
        process = psutil.Process()
        mem_before = process.memory_info().rss / 1024 ** 2  # Memory in MB

    # Measure inference time
    start_time = time.time()
    learn.model.eval()  # Set the model to evaluation mode
    with torch.no_grad():
        outputs = learn.model(inputs)
    end_time = time.time()
    
    # Measure memory usage after inference
    if device == 'cuda':
        peak_mem_usage = torch.cuda.max_memory_allocated(device) / 1024 ** 2  # Peak memory in MB
    else:
        process = psutil.Process()
        mem_after = process.memory_info().rss / 1024 ** 2  # Memory in MB
        peak_mem_usage = mem_after - mem_before
    
    # Calculate metrics
    total_time = end_time - start_time
    
    return total_time, peak_mem_usage

# Function to test inference speed on both CPU and GPU
def test_inference_speed(learn, dl, sample_sizes):
    results = []
    devices = ["cpu", "cuda"] if torch.cuda.is_available() else ["cpu"]
    
    for device in devices:
        learn.model.to(device)
        for num_samples in sample_sizes:
            total_time, memory_usage = measure_inference_speed(learn, dl, num_samples, device)
            results.append({
                "Device": device,
                "Sample Size": num_samples,
                "Total Inference Time (s)": total_time,
                "Peak Memory Usage (MB)": memory_usage
            })
    
    return pd.DataFrame(results)
In [137]:
import time
import psutil
import pandas as pd
import torch
from fastai.vision.all import *

def measure_inference_speed(learn, dl, num_samples, device):
    start_time = time.time()

    # Prepare a batch of data
    data_batch = next(iter(dl))
    inputs, _ = data_batch
    inputs = inputs[:num_samples].to(device)

    process = psutil.Process()

    if device == 'cuda':
        torch.cuda.empty_cache()
        baseline_mem = torch.cuda.memory_reserved(device) / 1024 ** 2  # Baseline memory in MB
        torch.cuda.reset_peak_memory_stats(device)
    else:
        baseline_mem = process.memory_info().rss / 1024 ** 2  # Baseline memory in MB

    learn.model.eval()  # Set the model to evaluation mode
    
    start_inference_time = time.time()
    with torch.no_grad():
        outputs = learn.model(inputs)
        if device == 'cuda':
            peak_mem_usage = (torch.cuda.max_memory_reserved(device) / 1024 ** 2) - baseline_mem  # Peak memory during inference
        else:
            peak_mem_usage = (process.memory_info().rss / 1024 ** 2) - baseline_mem  # Peak memory during inference
    end_time = time.time()
    
    # Calculate metrics
    total_time = end_time - start_time
    only_inference_time = end_time - start_inference_time
    
    return total_time, only_inference_time, peak_mem_usage

def test_inference_speed(learn, dl, sample_sizes):
    results = []
    devices = ["cpu", "cuda"] if torch.cuda.is_available() else ["cpu"]
    
    for device in devices:
        learn.model.to(device)
        for num_samples in sample_sizes:
            total_time, only_inference_time, memory_usage = measure_inference_speed(learn, dl, num_samples, device)
            results.append({
                "Device": device,
                "Sample Size": num_samples,
                "Total Time (s)": total_time,
                "Inference Time (s)": only_inference_time,
                "Peak Memory Usage (MB)": memory_usage
            })
    
    return pd.DataFrame(results)
In [140]:
# Assuming `learn` is already defined and loaded with the model
sample_sizes = [1, 10,50, 100, 500]
inference_df = test_inference_speed(learn, valid_dl, sample_sizes)
In [149]:
inference_df["Per Image (ms)"] = ((inference_df["Inference Time (s)"] / inference_df["Sample Size"])*1000).round(2)
In [150]:
inference_df
Out[150]:
Device Sample Size Total Time (s) Inference Time (s) Peak Memory Usage (MB) Per Image (ms)
0 cpu 1 3.158689 0.287251 0.585938 287.25
1 cpu 10 3.887971 1.122948 -4.468750 112.29
2 cpu 50 10.299034 7.485476 -1.796875 149.71
3 cpu 100 14.920235 11.995297 2.695312 119.95
4 cpu 500 16.881310 13.978759 4.289062 27.96
5 cuda 1 2.920804 0.015073 2.000000 15.07
6 cuda 10 3.015016 0.016202 2.000000 1.62
7 cuda 50 2.940215 0.016177 308.000000 0.32
8 cuda 100 3.028011 0.016233 924.000000 0.16
9 cuda 500 3.133774 0.018235 1572.000000 0.04

On a GPU Inference itself seems to be more or less instantenous with most time spent in loading the image to memory. In this specific configuration CPU inference is several hundred times slower. This indicates that ResNet50 isn't really suitable for CPU inference and we should chose ResNet18 or 34 if that's a requirement.

Device info:

In [152]:
display(f"GPU: {torch.cuda.get_device_name(0)}, CPU: {get_cpu_info()}")
"GPU: NVIDIA GeForce RTX 3090, CPU: ('AMD EPYC 7702P 64-Core Processor', 128)"

Main Observations¶

Potential Issues¶

  • Model parameters were tuned using Optuna and multiple sessions:

    • 40 trials for eliminating the least useful parameter values and ranges that resulted in significant performance degradation.
    • 130 trials for final tuning.
  • The tuning process took around X.X hours on an RTX 3090, significantly improving performance from the baseline case (using cnn_learner + lr_find to find the optimal learning rate by briefly training the model on a range of learning rates). From F1 ~= 0.915 to ~= 0.965 after:

    • Selecting optimal batch_size, weight_decay, freeze_epochs, and pct_start values.
    • Changing LR scheduler to flat_cos.
    • Adding additional augmentation transformations like: rotations, scaling, zooming, flipping, random erasing, warping, etc.
    • The tuning process was very slow and inefficient; we should probably be able to improve this by using a more aggressive pruning strategy and combining it with lr_find instead of Bayesian optimization for selecting the optimal LR:
      • Additionally, we only used fixed sets of augmentations which limited the search space; we should consider tuning individual augmentation parameters as well.
Overfitting¶
  • Direct overfitting does not seem to have been a significant concern; the difference between train_loss and valid_loss was at most ~0.065 or less.
  • The model was hypertuned with Optuna using a fixed train-validation split and a fixed seed for training. This is not ideal because the sample is very small and likely resulted in Optuna indirectly overfitting on the validation set when selecting the optimal parameters. Ideally, we'd use CV but that would have extended the tuning time significantly.

Future Improvements:¶

  • Improve sample selection and filtering and test performance on different subsamples i.e. we've found that there is a lot of variance between images for the same class (i.e. different backgrounds or different zoom levels or composition, like multiple vs individual mushrooms) we use this data to potentiall select more optimal training samples.